Unsupervised Keyword Extraction from Polish Legal Texts
نویسندگان
چکیده
In this work, we present an application of the recently proposed unsupervised keyword extraction algorithm RAKE to a corpus of Polish legal texts from the field of public procurement. RAKE is essentially a language and domain independent method. Its only languagespecific input is a stoplist containing a set of non-content words. The performance of the method heavily depends on the choice of such a stoplist, which should be domain adopted. Therefore, we complement RAKE algorithm with an automatic approach to selecting non-content words, which is based on the statistical properties of term distribution.
منابع مشابه
Application of Topic Models to Judgments from Public Procurement Domain
[4] M. Jungiewicz, M. Łopuszyński, Unsupervised keyword extraction from Polish legal texts, In Advances in NLP, 65–70, Springer (2014) [5] M. Łopuszyński, Ł. Bolikowski, Towards robust tags from NLP tools and Wikipedia, Int. Journal of Digit. Libraries (2015) (available online) I acknowledge the support from the SAOS project financed by the National Centre for Research and Development. I acknow...
متن کاملTextRank: Bringing Order Into Texts
In this paper, we introduce TextRank – a graph-based ranking model for text processing, and show how this model can be successfully used in natural language applications. In particular, we propose two innovative unsupervised methods for keyword and sentence extraction, and show that the results obtained compare favorably with previously published results on established benchmarks.
متن کاملKeyword Extraction for Text Characterization
Keywords are valuable means for characterizing texts. In order to extract keywords we propose an efficient and robust, language-and domain-independent approach which is based on small word parts (quadgrams). The basic algorithm can be improved by reexamining and re-ranking keywords using edit distance (i.e. Levenshtein distance) and an algorithm based on the relativistic addition of velocities ...
متن کاملResources for Information Extraction from Polish texts
The paper presents a collection of resources developed for Information Extraction (IE) from Polish texts. In particular, we mention two IE platforms adapted to Polish and several IE applications built on top of one of them: named entity recognition, creation of terminology lexicons, and data extraction from medical texts.
متن کاملKeyword extraction: a review of methods and approaches
Paper presents a survey of methods and approaches for keyword extraction task. In addition to the systematization of methods, the paper gathers a comprehensive review of existing research. Related work on keyword extraction is elaborated for supervised and unsupervised methods, with special emphasis on graphbased methods as well as Croatian keyword extraction. Selectivity-based keyword extracti...
متن کامل